Workshop Packages

library(tsibble)
library(fable)
library(tsibbledata)

Supplementary Packages

library(tidyverse)
library(lubridate)
library(ggplot2)
library(urca)
library(feasts)



Tsibble

Object creation with ‘tsibble()’

# Create a tsibble object from scratch
y <- tsibble(
  mth = rep(yearmonth("2010 Jan") + 0:8, each = 3),
  a_observation = rep(c("x", "y", "z"), each = 9),
  b_observation = rep(letters[1:3], times = 9),
  value = rnorm(27),
  key = c(a_observation, b_observation)
)
y

Handle implicit missing values

What are implicit missing values and how are they different from explicit missing values?

Implicit missing values represent a gap in the observations based on the time series index. Many operations assume that an intact vector input ordered in time, and thus, dealing with implicit missing values becomes relevant.

Tsibble provides 4 functions that help to understand and deal with implicit missing values. - (1) has_gaps() checks if there are implicit values missing - (2) scan_gaps() reports all missing values - (3) count_gaps() sums up the time ranges that are absent from the data - (4) fill_gaps() turns implicit missing values into explicit ones, along with imputing values by values or functions

The “pedestrian” dataset contains the hourly pedestrian counts from 2015-01-01 to 2016-12-31 at 4 sensors in the city of Melbourne.

df_pedestrian <- as.data.frame(pedestrian)
df_pedestrian

1. Lets turn the pedestrian dataframe into a tsibble object using “Sensor” as a key and “Data_Time” as index

tsbl_pedestrian <- df_pedestrian %>% 
  as_tsibble(key = Sensor, index = Date_Time)
tsbl_pedestrian

2. Explore implicit missing values with “has, scan, and count_gaps”

has_gaps(tsbl_pedestrian, .full = TRUE)

As we can see, each sensor has gaps in time. Lets create a list of all implicit missing values

scan_gaps(tsbl_pedestrian, .full = TRUE)

as we can see. There are in total 4,139 implicit missing observations. Lets investigate the distribution between each Sensor!

ped_gaps <- tsbl_pedestrian %>% 
  count_gaps(.full = TRUE)
ped_gaps

which Sensor has the most missing observations?

ped_gaps %>% 
  group_by(Sensor) %>% 
  summarise(Sensor = Sensor,
            Total_missing = sum(.n)) %>% 
  distinct()

3. Fill the gaps with fill_gaps() !

with NAs

ped_full <- tsbl_pedestrian %>% 
  fill_gaps(.full = TRUE)
ped_full %>% 
  filter(is.na(Count))

all previously identified 4,139 observations now have NAs!

But what if we want to avoid NAs in order to conduct certain operations.

we can also fill the gaps with zeros or any other given value

tsbl_pedestrian %>% 
  fill_gaps(Count = 0L, .full = TRUE) %>% 
  filter(Count == 0L)

we can see that there are 8 more rows than expected. This is probably due to some original observations where Count == 0

We can also fill the gaps using a function. For example, we might consider it appropriate to fill the gaps with the mean count of the sensor!

tidy_pedestrians <- tsbl_pedestrian %>% 
  group_by_key() %>% 
  fill_gaps(Count = mean(Count), .full = TRUE)

4. Lets find the the station with the most busy day!

daily_ped <- tsbl_pedestrian %>% 
  group_by_key() %>% 
  index_by(date = ~ as.Date.POSIXct(.)) %>% 
  summarise(
    ped_mean = mean(Count, na.rm = TRUE)
  ) 
daily_ped %>% arrange(desc(ped_mean))
daily_ped %>% filter(Sensor == "Birrarung Marr") %>% autoplot()


Fable

tourism

1. Data Manipulation

We are interested in the total amount of trips

tourism_t_trips <- tourism %>% 
  summarise(Trips = sum(Trips))
tourism_t_trips

2.Data Exploration

tourism_t_trips %>% 
  autoplot(Trips)

What can we see? There is seasonality and a negative trend that turns positive after 2010. As experienced forecasters, we know that under such conditions, we should choose a exponential smoothing model.

3. Comparing forecasting performance of multiple models

fit <- tourism_t_trips %>% 
  model(
    ets = ETS(Trips),
    arima = ARIMA(Trips),
    lm = TSLM(Trips ~ trend() + season())
  )

fit %>% 
  forecast(h = "2 years") %>% 
  autoplot(tourism_t_trips, level = 80, alpha = 0.5)

Train - Test Split

tourism_t_trips %>% 
  # Withhold the last 3 years before fitting the model
  filter(Quarter < yearquarter("2015 Q1")) %>% 
  # Estimate the models on the training data (1998-2014)
  model(
    ets = ETS(Trips),
    arima = ARIMA(Trips),
    lm = TSLM(Trips ~ trend() + season())
  ) %>% 
  # Forecast the witheld time peroid (2015-2017)
  forecast(h = "3 years") %>% 
  # Compute accuracy of the forecasts relative to the actual data 
  accuracy(tourism_t_trips)